CLAIMS 

What is claimed is: 

1 . A method for retrieving information using a search engine comprising the steps 

of: 

(a) retrieving a document to be indexed; 

(b) generating a document extract corresponding to the document; 

(c) decomposing the document extract into a plurality of tokens; and 

(d) storing the plurality of tokens in a search index, wherein the search engine 
accesses the search index to retrieve information in one or more document extracts satisfying 
a search query. 

2. The method of claim 1 5 wherein the generating step (b) further comprises the 
steps of: 

(bl) extracting a portion of the document that characterizes the document's 
subject content to form the document extract; and 

(b2) recording positional information of the portion extracted within the 
document. 

3. The method of claim 2, further comprising the step of: 
(e) storing the document extract in a storage device. 

4. The method of claim 3, wherein the storing step (d) further comprises: 
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(d 1 ) storing the recorded positional information with the plurality of 

tokens, 

5. The method of claim 4, wherein the extracting step (bl) further comprises the 

step of: 

(bli) extracting from the document a collection of sentences that are 
characteristic of the document's subject content to form a document 
summary. 

6. The method of claim 4, wherein the decomposing step (c) further comprises: 
(cl) selecting from the document extract one of a whole sentence, a 

portion of a sentence, a word, and a feature. 

7. The method of claim 6 ? wherein the selecting step (cl) further comprises: 

(cli) selecting based on frequency of occurrence, 
word-salient-measure, proximity to the beginning of a paragraph, proximity 
the beginning of the document,andproximity to or position within a heading 
or a caption. 

8. The method of claim 1 5 wherein the document is a web-page in the Internet. 

9. A computer readable medium containing programming instructions for 
retrieving information using a search engine comprising the instructions for: 
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(a) retrieving a document to be indexed; 

(b) generating a document extract corresponding to the document; 

(c) decomposing the document extract into a plurality of tokens; and 

(d) storing the plurality of tokens in a search index, wherein the search engine 
accesses the search index to retrieve information in one or more document extracts satisfying 
a search query. 

10. The computer readable medium of claim 9, wherein the generating instruction 
(b) further comprises the instructions for: 

(bl) extracting a portion of the document that characterizes the document's 
subject content to form the document extract; and 

(b2) recording positional information of the portion extracted within the 
document. 

1 1 . The computer readable medium of claim 3, further comprising the instruction 

for: 

(e) storing the document extract in a storage device. 



12. The computer readable medium of clqim 1 1, wherein the storing instruction 
(d) further comprises the instruction for: 

(dl) storing the recorded positional information with the plurality of 

tokens. 
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1 3 . The computer readable medium of claim 1 2, wherein the extracting 
instruction (bl) further comprises the instruction for: 

(b 1 i) extracting from the document a collection of sentences that are 
characteristic of the document's subject content to form a document 
summary. 

14. The computer readable medium of claim 12, wherein the decomposing 
instruction (c) further comprises the instruction for: 

(cl) selecting from the document extract one of a whole sentence, a 
portion of a sentence, a word, and a feature. 

15. The computer readable medium of claim 14, wherein the selecting instruction 
(cl) further comprises the instruction for: 

(cli) selecting based on frequency of occurrence, 
word-salient-measure, proximity to the beginning of a paragraph, proximity 
the beginning of the document, and proximity to and position within a 
heading and a caption. 

1 6. The computer readable medium of claim 9, wherein the document is a 
web-page in the Internet. 

1 7. A system for retrieving information, wherein the system includes a search 
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engine comprising: 

means for retrieving a document from a document repository; 

an information extractor coupled to the means for retrieving, wherein the information 
extractor generates a document extract corresponding to the document; 

a storage device coupled to the information extractor for storing the document 

extract; 

a search engine indexer coupled to the storage device for decomposing the document 
extract into a plurality of tokens; and 

a search index coupled to the search engine indexer for storing the plurality of 
tokens, wherein the search engine accesses the search index to retrieve information in one or 
more document extracts satisfying a search query, 

18. The system of claim 1 7, wherein the information extractor extracts a portion 
of the document that characterizes the document's subject content to form the document 
extract, and records positional information of the portion extracted within the document. 

19. The system of claim 1 8, wherein the search index stores the positional 
information associated with the plurality of tokens. 

20. The system of claim 1 9, wherein a token of the plurality of tokens comprises 
one of a whole sentence, a portion of a sentence, a word, and a feature of the document. 
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21. The system of claim 20, wherein the search engine indexer selects the 
plurality of tokens based on frequency of occurrence, word-salient-measure, proximity to the 
beginning of a paragraph, proximity the beginning of the document, and proximity to and 
position within a heading and a caption. 

22. The system of claim 17, wherein the document respository is the Internet and 
the document is a web-page. 

23. The system of claim 22, wherein the means for retrieving the document is a 
web crawler. 



DE920000094US1/2265P 



-26- 



